graph LR
A["Existing Doc Benchmarks<br/>Limited doc types<br/>Simplified evaluation"] --> B["Unfair comparisons<br/>between models"]
B --> C["OmniDocBench 1.5<br/>1,355 pages · 9 doc types<br/>Multi-level evaluation"]
C --> D["Fair, fine-grained<br/>document parsing<br/>evaluation"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
OmniDocBench 1.5
A comprehensive benchmark for evaluating diverse PDF document parsing — covering text OCR, table recognition, formula extraction, and layout detection across 1,355 real-world pages
Keywords: OmniDocBench, document parsing benchmark, PDF parsing, OCR evaluation, table recognition, formula recognition, layout detection, CVPR 2025, OpenDataLab, Shanghai AI Laboratory, document understanding, VLM evaluation, pipeline evaluation, TEDS, CDM, edit distance

Introduction
Large language models and RAG systems are only as good as the documents they can parse. Yet the ability of AI to accurately extract text, tables, formulas, and layout from real-world PDFs — academic papers, financial reports, handwritten notes, newspapers — has lacked a fair, comprehensive benchmark.
OmniDocBench fills this gap. It is a rigorously annotated benchmark spanning 1,355 PDF pages across 9 document types, 4 layout types, and 3 languages, with over 20,000 block-level and 80,000 span-level annotations. Version 1.5 (September 2025) expanded the dataset with 374 new pages, balanced Chinese/English coverage, and introduced an improved evaluation methodology.
“Despite recent progress, current document parsing methods have not been fairly and comprehensively evaluated due to the narrow coverage of document types and the simplified, unrealistic evaluation procedures in existing benchmarks.” — OmniDocBench Paper
What Is OmniDocBench?
OmniDocBench is a benchmark for evaluating diverse PDF document parsing in real-world scenarios. It assesses how well AI systems can convert complex PDF pages into structured, machine-readable output (typically Markdown), covering text extraction, table recognition, formula parsing, layout detection, and reading order.
Key Characteristics
| Feature | Details |
|---|---|
| Total pages | 1,355 PDF pages (v1.5) |
| Document types | 9 — academic papers, textbooks, financial reports, newspapers, handwritten notes, PPTs, magazines, test papers, books |
| Layout types | 4 — single-column, double-column, three-column, complex |
| Languages | 3 — English, Chinese, mixed |
| Block-level annotations | 15 categories (text paragraphs, headings, tables, etc.) — over 20,000 |
| Span-level annotations | 4 categories (text lines, inline formulas, subscripts, etc.) — over 80,000 |
| Table annotations | Both LaTeX and HTML formats |
| Formula annotations | LaTeX format with language attributes |
| Reading order | Full reading-order annotations for document components |
| Attribute labels | 5 page-level + 3 text-level + 6 table-level attribute tags |
| License | Apache-2.0 |
| Accepted at | CVPR 2025 |
What Makes It Comprehensive?
Unlike narrow benchmarks that focus on a single document type or a single extraction task, OmniDocBench evaluates five distinct capabilities across diverse, real-world documents:
graph TD
ODB["OmniDocBench 1.5<br/>1,355 PDF pages"] --> E2E["End-to-End<br/>Document Parsing"]
ODB --> OCR["Text OCR<br/>Recognition"]
ODB --> TAB["Table<br/>Recognition"]
ODB --> FORM["Formula<br/>Recognition"]
ODB --> LAY["Layout<br/>Detection"]
E2E --> M1["Edit Distance · BLEU<br/>METEOR · TEDS · CDM"]
OCR --> M2["Normalized<br/>Edit Distance"]
TAB --> M3["TEDS<br/>(Tree Edit Distance)"]
FORM --> M4["CDM<br/>(Character Detection Matching)"]
LAY --> M5["COCODet<br/>(mAP, mAR)"]
style ODB fill:#e74c3c,color:#fff,stroke:#333
style E2E fill:#3498db,color:#fff,stroke:#333
style OCR fill:#27ae60,color:#fff,stroke:#333
style TAB fill:#f39c12,color:#fff,stroke:#333
style FORM fill:#8e44ad,color:#fff,stroke:#333
style LAY fill:#e67e22,color:#fff,stroke:#333
style M1 fill:#ecf0f1,color:#333,stroke:#bdc3c7
style M2 fill:#ecf0f1,color:#333,stroke:#bdc3c7
style M3 fill:#ecf0f1,color:#333,stroke:#bdc3c7
style M4 fill:#ecf0f1,color:#333,stroke:#bdc3c7
style M5 fill:#ecf0f1,color:#333,stroke:#bdc3c7
Version 1.5 Updates (September 2025)
OmniDocBench v1.5 introduced several important improvements over v1.0:
- +374 new pages — balanced Chinese/English page counts and increased formula-rich pages
- Higher resolution — newspaper and note images upgraded from 72 DPI to 200 DPI
- Improved matching algorithm — formulas and text can now be matched with each other, reducing score errors from Unicode formula outputs
- Simplified Overall metric — now calculated as: \text{Overall} = \frac{(1 - \text{Text Edit Distance}) \times 100 + \text{Table TEDS} + \text{Formula CDM}}{3}
- Language attributes for formulas — 68 Chinese + 982 English formulas
- Inline formulas increased from 353 to 1,050
Who Built It?
OmniDocBench was developed by researchers at OpenDataLab / Shanghai AI Laboratory (Shanghai Artificial Intelligence Laboratory). The lead authors include:
- Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He
The project was published at CVPR 2025 (IEEE/CVF Conference on Computer Vision and Pattern Recognition), one of the top-tier venues in computer vision.
| Resource | Link |
|---|---|
| arXiv paper | arxiv.org/abs/2412.07626 |
| GitHub | github.com/opendatalab/OmniDocBench |
| Official site | opendatalab.com/omnidocbench |
What Skills Does It Test?
OmniDocBench evaluates the full spectrum of document understanding capabilities:
| Capability | What It Tests | Metric |
|---|---|---|
| End-to-end parsing | Full-page PDF-to-Markdown conversion — text, tables, formulas, reading order combined | Overall (composite), Edit Distance |
| Text OCR | Accurate recognition of text paragraphs across languages, fonts, and layouts | Normalized Edit Distance |
| Table recognition | Structural and content extraction of tables (simple, complex, merged cells) | TEDS (Tree Edit Distance Similarity) |
| Formula recognition | Correct LaTeX transcription of display and inline formulas | CDM (Character Detection Matching) |
| Layout detection | Localization and classification of document components (text, tables, figures, etc.) | COCODet metrics (mAP, mAR) |
| Reading order | Correct sequencing of document elements for downstream processing | Edit Distance |
Three Categories of Models Evaluated
OmniDocBench evaluates three fundamentally different approaches to document parsing:
graph LR
A["Specialized VLMs<br/>PaddleOCR-VL, MinerU,<br/>MonkeyOCR, Dolphin"] --> D["End-to-End<br/>Leaderboard"]
B["General VLMs<br/>Qwen3-VL, Gemini,<br/>GPT-4o, InternVL"] --> D
C["Pipeline Tools<br/>PP-StructureV3, Marker,<br/>MinerU-pipeline"] --> D
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#27ae60,color:#fff,stroke:#333
style C fill:#8e44ad,color:#fff,stroke:#333
style D fill:#e74c3c,color:#fff,stroke:#333
Current Leaderboard
The leaderboard below shows the end-to-end document parsing results on OmniDocBench v1.5. The Overall score is the composite metric: \frac{(1 - \text{Edit Dist}) \times 100 + \text{TEDS} + \text{CDM}}{3}. Higher Overall is better; lower Edit Distance is better.
Source: OmniDocBench GitHub Repository (consulted March 29, 2026). Dataset version 1.5 (September 2025).
Specialized Document VLMs
| Rank | Model | Size | Overall ↑ | Edit Dist ↓ | Table TEDS ↑ | Formula CDM ↑ |
|---|---|---|---|---|---|---|
| 1 | PaddleOCR-VL | 0.9B | 92.86 | 0.035 | 90.89 | 94.76 |
| 2 | MinerU2.5 | 1.2B | 90.67 | 0.047 | 88.22 | 92.38 |
| 3 | OpenDoc-0.1B | 0.1B | 90.49 | 0.043 | 88.05 | 91.97 |
| 4 | MonkeyOCR-pro-3B | 3B | 88.85 | 0.075 | 86.78 | 90.63 |
| 5 | OCRVerse | 4B | 88.56 | 0.058 | 84.55 | 88.45 |
| 6 | dots.ocr | 3B | 88.41 | 0.048 | 86.78 | 90.62 |
| 7 | MonkeyOCR-3B | 3B | 87.13 | 0.075 | 81.39 | 85.92 |
| 8 | Deepseek-OCR | 3B | 87.01 | 0.073 | 84.97 | 88.80 |
| 9 | MonkeyOCR-pro-1.2B | 1.2B | 86.96 | 0.084 | 84.24 | 89.02 |
| 10 | Nanonets-OCR-s | 3B | 85.59 | 0.093 | 80.14 | 85.57 |
| 11 | MinerU2-VLM | 0.9B | 85.56 | 0.078 | 83.54 | 87.66 |
| 12 | Dolphin-1.5 | 0.3B | 83.21 | 0.092 | 78.06 | 84.10 |
| 13 | olmOCR | 7B | 81.79 | 0.096 | 68.92 | 74.77 |
| 14 | POINTS-Reader | 3B | 80.98 | 0.134 | 77.13 | 81.66 |
| 15 | Mistral OCR | — | 78.83 | 0.164 | 70.03 | 78.04 |
| 16 | OCRFlux | 3B | 74.82 | 0.193 | 75.75 | 80.23 |
| 17 | Dolphin | 0.3B | 74.67 | 0.125 | 68.70 | 77.77 |
General Vision-Language Models
| Rank | Model | Size | Overall ↑ | Edit Dist ↓ | Table TEDS ↑ | Formula CDM ↑ |
|---|---|---|---|---|---|---|
| 1 | Qwen3-VL-235B | 235B | 89.15 | 0.069 | 86.21 | 90.55 |
| 2 | Gemini-2.5 Pro | — | 88.03 | 0.075 | 85.71 | 90.29 |
| 3 | Qwen2.5-VL | 72B | 87.02 | 0.094 | 82.15 | 86.22 |
| 4 | InternVL3.5 | 241B | 82.67 | 0.142 | 75.00 | 81.28 |
| 5 | InternVL3 | 78B | 80.33 | 0.131 | 70.64 | 77.74 |
| 6 | GPT-4o | — | 75.02 | 0.217 | 67.07 | 76.09 |
Pipeline Tools
| Rank | Model | Overall ↑ | Edit Dist ↓ | Table TEDS ↑ | Formula CDM ↑ |
|---|---|---|---|---|---|
| 1 | PP-StructureV3 | 86.73 | 0.073 | 81.68 | 89.48 |
| 2 | Mineru2-pipeline | 75.51 | 0.209 | 70.90 | 79.11 |
| 3 | Marker-1.8.2 | 71.30 | 0.206 | 57.88 | 71.17 |
Key takeaways:
- PaddleOCR-VL (0.9B) leads overall at 92.86, proving that specialized smaller models can outperform massive general VLMs on document parsing
- Among general VLMs, Qwen3-VL-235B (89.15) and Gemini-2.5 Pro (88.03) compete closely with specialized models
- GPT-4o scores only 75.02 — significantly behind purpose-built document parsers
- Tiny models like OpenDoc-0.1B (90.49) and Dolphin-1.5 (0.3B, 83.21) demonstrate impressive efficiency for their size
Where to Explore the Benchmark
Dashboards and Resources
| Resource | Description | Link |
|---|---|---|
| Official Site | OpenDataLab’s OmniDocBench leaderboard and dataset portal | opendatalab.com/omnidocbench |
| GitHub Repository | Evaluation code, configs, inference scripts, and result tables | github.com/opendatalab/OmniDocBench |
| Hugging Face Dataset | Download the 1,355-page annotated dataset (1.25 GB) | huggingface.co/datasets/opendatalab/OmniDocBench |
| OpenDataLab Dataset | Alternative dataset download | opendatalab.com/OpenDataLab/OmniDocBench |
| arXiv Paper | Full technical paper with methodology and analysis | arxiv.org/abs/2412.07626 |
Load the Dataset
from datasets import load_dataset
dataset = load_dataset("opendatalab/OmniDocBench", split="train")
print(f"Total pages: {len(dataset)}")
# Total pages: 1358Run the Evaluation
# Setup
conda create -n omnidocbench python=3.10
conda activate omnidocbench
pip install -r requirements.txt
# Run evaluation with your model's markdown output
python pdf_validation.py --config configs/end2end.yamlThe evaluation framework supports flexible configuration files for each task: end2end, md2md, table recognition, formula recognition, OCR, and layout detection.
Understanding the Metrics
Overall Score
The primary end-to-end metric combines three component scores:
\text{Overall} = \frac{(1 - \text{Text Edit Distance}) \times 100 + \text{Table TEDS} + \text{Formula CDM}}{3}
This gives equal weight to text, table, and formula extraction quality.
Component Metrics
| Metric | Range | What It Measures | Used For |
|---|---|---|---|
| Edit Distance | 0–1 ↓ | Character-level differences between predicted and ground-truth text | Text OCR, reading order |
| TEDS | 0–100 ↑ | Tree Edit Distance Similarity — structural + content accuracy of tables | Table recognition |
| CDM | 0–100 ↑ | Character Detection Matching — precision of formula LaTeX transcription | Formula recognition |
| BLEU / METEOR | 0–1 ↑ | Standard NLP metrics for text similarity | Alternative text quality measures |
| mAP / mAR | 0–1 ↑ | COCO detection metrics for bounding box localization | Layout detection |
Document Types and Attribute Analysis
OmniDocBench goes beyond aggregate scores by providing fine-grained, attribute-level results. You can break down performance by:
- Document type — academic paper, textbook, financial report, newspaper, handwritten note, PPT, magazine, test paper, book
- Layout complexity — single-column, double-column, three-column, complex
- Language — Chinese, English, mixed
- Table attributes — simple vs. complex, with/without merged cells, colored backgrounds
- Text attributes — font size, orientation, special characters
Why OmniDocBench Matters
graph LR
A["LLMs & RAG need<br/>accurate document<br/>parsing"] --> B["Existing benchmarks<br/>too narrow"]
B --> C["OmniDocBench<br/>fills the gap"]
C --> D["Better document AI<br/>for real-world use"]
A2["Models compared<br/>unfairly"] --> B2["Different eval<br/>methodologies"]
B2 --> C
C --> D2["Standardized,<br/>reproducible<br/>benchmarking"]
style A fill:#e74c3c,color:#fff,stroke:#333
style A2 fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style D2 fill:#3498db,color:#fff,stroke:#333
- Diverse and realistic — 9 document types covering the full range of real-world PDFs, not just academic papers
- Multi-level evaluation — end-to-end, task-specific, and attribute-level analysis to pinpoint model weaknesses
- Fair comparison — standardized evaluation code ensures reproducible, apples-to-apples comparisons
- Covers the full pipeline — text, tables, formulas, layout, and reading order in a single benchmark
- Active community — 1.6k GitHub stars, 14 contributors, regular model additions (Docker support added November 2025)
Video: OmniDocBench Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
OmniDocBench 1.5 sets the standard for document parsing evaluation:
- 1,355 PDF pages across 9 document types, 4 layouts, and 3 languages — far broader than any predecessor
- 100,000+ annotations at block and span levels, with reading order and multi-format table/formula ground truth
- Five evaluation dimensions — end-to-end, text OCR, table, formula, and layout detection
- The best specialized model (PaddleOCR-VL) achieves 92.86 Overall — but general VLMs like GPT-4o still score only 75.02, revealing a significant gap
- Accepted at CVPR 2025 and actively maintained with regular model updates
As document AI becomes critical infrastructure for LLMs and RAG systems, OmniDocBench provides the rigorous, multi-dimensional evaluation needed to drive real progress — not just on cherry-picked academic papers, but across the messy diversity of real-world documents.
References
- Ouyang, L., Qu, Y., Zhou, H., Zhu, J., Zhang, R., Lin, Q., Wang, B., Zhao, Z., Jiang, M., Zhao, X., Shi, J., Wu, F., Chu, P., Liu, M., Li, Z., Xu, C., Zhang, B., Shi, B., Tu, Z., He, C. “OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations.” CVPR 2025. arxiv.org/abs/2412.07626
- OpenDataLab. “OmniDocBench Dataset.” Hugging Face. huggingface.co/datasets/opendatalab/OmniDocBench
- OpenDataLab. “OmniDocBench GitHub Repository.” github.com/opendatalab/OmniDocBench
- OpenDataLab. “OmniDocBench Official Site.” opendatalab.com/omnidocbench
Read More
- Explore how models handle multimodal understanding beyond documents — see MMMU-Pro
- Evaluate LLMs on scientific figure comprehension — see CharXiv Reasoning
- Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
- Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
- OmniDocBench GitHub Repository
- OmniDocBench Dataset on Hugging Face
- OmniDocBench Official Site